With Procrastinate Pro+ entertainment application, despite huge investments in advertising, the company has been suffering losses for the last few months. The task is to understand the reasons and help the company to become a plus. There is data on customers acquired from May 1 to October 27, 2019:
Study:
There are three datasets. The visits_info_short.csv file stores the server log with information about site visits, orders_info_short.csv — information about orders, and costs_info_short.csv — information about advertising expenses.
visits_info_short.csv structure:
orders_info_short.csv structure:
costs_info_short.csv structure:
We will upload data about visits, orders, and advertising expenses from CSV files to variables.
Then we'll examine the data and perform preprocessing, check if there are any omissions and duplicates in the data.
# importing the libraries needed for work
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from matplotlib import pyplot as plt
import seaborn as sns
import os
# let's define the file search function in Linux and Windows folders (if the notebook works locally)
def file_path(name):
# reading a data file with an attempt to take data locally: 4 attempts for the Ya-server and Windows
# the file should be located next to the notebook or in the nested directory "datasets"
for p in ('/datasets', '', os.path.abspath(""), os.path.abspath("")+'\\datasets\\'):
pth1 = os.path.join(p, name)
if os.path.exists(pth1):
break # if the search is successful - break the cycle ahead of schedule
else:
pth1 = ''
print(f'The file {name} was not found. Check its name and make sure it',
'it is present in the directory next to the notebook or in a subfolder "/datasets"')
print('Reading data from:', pth1, '\n')
return pth1
# reading 3 data files
name = 'visits_info_short.csv' # session log
sessions = pd.read_csv(file_path(name)) # writing data file to the dataframe
name = 'orders_info.csv' # purchases
orders = pd.read_csv(file_path(name)) # writing data file to the dataframe
name = 'costs_info_short.csv' # ad costs
costs = pd.read_csv(file_path(name)) # writing data file to the dataframe
Reading data from: visits_info_short.csv Reading data from: orders_info.csv Reading data from: costs_info_short.csv
frames_list = [sessions, orders, costs] # list of 3 dataframes
frames_names = ['sessions', 'orders', 'costs'] # also put their names on the list
for i in range(3): # check for the number of passes
display(frames_names[i], frames_list[i].isna().sum())
'sessions'
User Id 0 Region 0 Device 0 Channel 0 Session Start 0 Session End 0 dtype: int64
'orders'
User Id 0 Event Dt 0 Revenue 0 dtype: int64
'costs'
dt 0 Channel 0 costs 0 dtype: int64
for i in range(3): # check if there are duplicates
display(frames_names[i], frames_list[i].duplicated().sum())
'sessions'
0
'orders'
0
'costs'
0
# for 3 dataframes, output first 5 lines (and take a closer look at the column names)
for i in range(3):
display(frames_names[i], frames_list[i].head())
'sessions'
| User Id | Region | Device | Channel | Session Start | Session End | |
|---|---|---|---|---|---|---|
| 0 | 981449118918 | United States | iPhone | organic | 2019-05-01 02:36:01 | 2019-05-01 02:45:01 |
| 1 | 278965908054 | United States | iPhone | organic | 2019-05-01 04:46:31 | 2019-05-01 04:47:35 |
| 2 | 590706206550 | United States | Mac | organic | 2019-05-01 14:09:25 | 2019-05-01 15:32:08 |
| 3 | 326433527971 | United States | Android | TipTop | 2019-05-01 00:29:59 | 2019-05-01 00:54:25 |
| 4 | 349773784594 | United States | Mac | organic | 2019-05-01 03:33:35 | 2019-05-01 03:57:40 |
'orders'
| User Id | Event Dt | Revenue | |
|---|---|---|---|
| 0 | 188246423999 | 2019-05-01 23:09:52 | 4.99 |
| 1 | 174361394180 | 2019-05-01 12:24:04 | 4.99 |
| 2 | 529610067795 | 2019-05-01 11:34:04 | 4.99 |
| 3 | 319939546352 | 2019-05-01 15:34:40 | 4.99 |
| 4 | 366000285810 | 2019-05-01 13:59:51 | 4.99 |
'costs'
| dt | Channel | costs | |
|---|---|---|---|
| 0 | 2019-05-01 | FaceBoom | 113.3 |
| 1 | 2019-05-02 | FaceBoom | 78.1 |
| 2 | 2019-05-03 | FaceBoom | 85.8 |
| 3 | 2019-05-04 | FaceBoom | 136.4 |
| 4 | 2019-05-05 | FaceBoom | 122.1 |
# there is a mess with the column names, rename then
frames_list[0].columns = ['user_id','region','device','channel','session_start','session_end']
frames_list[1].columns = ['user_id', 'event_dt', 'revenue']
frames_list[2].columns = ['dt', 'channel', 'costs']
for i in range(3): # data types in each dataframe
display(frames_names[i], frames_list[i].info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 309901 entries, 0 to 309900 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 309901 non-null int64 1 region 309901 non-null object 2 device 309901 non-null object 3 channel 309901 non-null object 4 session_start 309901 non-null object 5 session_end 309901 non-null object dtypes: int64(1), object(5) memory usage: 14.2+ MB
'sessions'
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40212 entries, 0 to 40211 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 40212 non-null int64 1 event_dt 40212 non-null object 2 revenue 40212 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 942.6+ KB
'orders'
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1800 entries, 0 to 1799 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dt 1800 non-null object 1 channel 1800 non-null object 2 costs 1800 non-null float64 dtypes: float64(1), object(2) memory usage: 42.3+ KB
'costs'
None
# translating the time strings (object class) into the "time" class
sessions['session_start'] = pd.to_datetime(sessions['session_start'])
sessions['session_end'] = pd.to_datetime(sessions['session_end'])
orders['event_dt'] = pd.to_datetime(orders['event_dt'])
costs['dt'] = pd.to_datetime(costs['dt']).dt.date # here we will translate to the date
# in the date intervals, it can be seen that the expense records end 4 days before the end
# sessions and someone's session on the last day ended at 1:38 a.m. - there are no anomalies
print(sessions['session_start'].min(), sessions['session_start'].max())
print(sessions['session_end'].min(), sessions['session_end'].max())
print(orders['event_dt'].min(), orders['event_dt'].max())
print(costs['dt'].min(), costs['dt'].max())
2019-05-01 00:00:41 2019-10-31 23:59:23 2019-05-01 00:07:06 2019-11-01 01:38:46 2019-05-01 00:28:11 2019-10-31 23:56:56 2019-05-01 2019-10-27
Functions for calculating metric values:
get_profiles() — to create user profiles,get_retention() — to calculate Retention Rate,get_conversion() — to calculate conversion,get_ltv() — for calculating LTV.Functions for plotting:
# get_profiles() — functions for creating user profiles
def get_profiles(sessions): # 1st version, simplified, we use it in clause 3.1
# sorting sessions by user ID and date of visit
# group by ID and find the first values of session_start and channel
# column with time of the first visit will be called first_ts
profiles = (
sessions.sort_values(by=['user_id', 'session_start'])
.groupby('user_id')
.agg({'session_start': 'first', 'channel': 'first'})
.rename(columns={'session_start': 'first_ts'})
.reset_index() # returning user_id from the index
)
# we determine the date of the first visit and the
# first day of the month in which this visit occurred
# this data will be needed for cohort analysis
profiles['dt'] = profiles['first_ts'].dt.date
profiles['month'] = profiles['first_ts'].astype('datetime64[M]')
return profiles
# version 2 - full (we use it in clauses 3.2-3.4)
def get_profiles_2(sessions, orders, events, ad_costs, event_names=[]):
# sorting sessions by user ID and engagement date
# group by ID and find the parameters of the first visits
profiles = (
sessions.sort_values(by=['user_id', 'session_start'])
.groupby('user_id')
.agg(
{
'session_start': 'first',
'channel': 'first',
'device': 'first',
'region': 'first',
}
)
# call the time of the first visit first_ts
.rename(columns={'session_start': 'first_ts'})
.reset_index() # returning user_id from the index
)
# for cohort analysis, we determine the date of the first visit
# and the first day of the month in which this visit occurred
profiles['dt'] = profiles['first_ts'].dt.date
profiles['month'] = profiles['first_ts'].astype('datetime64[M]')
# adding a sign of paying users
profiles['payer'] = profiles['user_id'].isin(orders['user_id'].unique())
# adding flags for all events from event_names
for event in event_names:
if event in events['event_name'].unique():
# check if each user is dating
# among those who made the event
profiles[event] = profiles['user_id'].isin(
events.query('event_name == @event')['user_id'].unique()
)
# counting the number of unique users
# with the same source and date of attraction (registration)
new_users = (
profiles.groupby(['dt', 'channel'])
.agg({'user_id': 'nunique'})
# column with number of users will be called unique_users
.rename(columns={'user_id': 'unique_users'})
.reset_index() # returning dt and channel from the indexes
)
# combine spending on advertising and the number of attracted users
# by date and channel of attraction
ad_costs = ad_costs.merge(new_users, on=['dt', 'channel'], how='left')
# divide advertising expenses by the number of attracted users
# we will save the results in the acquisition_cost (CAC) column
ad_costs['acquisition_cost'] = ad_costs['costs'] / ad_costs['unique_users']
# add the cost of attraction to the profiles
profiles = profiles.merge(
ad_costs[['dt', 'channel', 'acquisition_cost']],
on=['dt', 'channel'],
how='left',
)
# organic users are not associated with advertising data,
# therefore, in the acquisition_cost column they have the values NaN
# replace them with zero, because the cost of attraction is zero
profiles['acquisition_cost'] = profiles['acquisition_cost'].fillna(
0
)
return profiles # returning profiles with CAC
# get_conversion() — to count conversions
def get_conversion(
profiles,
purchases,
observation_date,
horizon_days,
dimensions=[],
ignore_horizon=False,
):
# exclude users who have not "lived up" to the horizon of analysis
last_suitable_acquisition_date = observation_date
if not ignore_horizon:
last_suitable_acquisition_date = observation_date - timedelta(
days=horizon_days - 1
)
result_raw = profiles.query('dt <= @last_suitable_acquisition_date')
# determine the date and time of the first purchase for each user
first_purchases = (
purchases.sort_values(by=['user_id', 'event_dt'])
.groupby('user_id')
.agg({'event_dt': 'first'})
.reset_index()
)
# adding data about purchases in the profile
result_raw = result_raw.merge(
first_purchases[['user_id', 'event_dt']], on='user_id', how='left'
)
# calculate lifetime for each purchase
result_raw['lifetime'] = (
result_raw['event_dt'] - result_raw['first_ts']
).dt.days
# group by cohort if there is nothing in dimensions
if len(dimensions) == 0:
result_raw['cohort'] = 'All users'
dimensions = dimensions + ['cohort']
# function for grouping a table by desired attributes
def group_by_dimensions(df, dims, horizon_days):
# building a "triangular" conversion table
result = df.pivot_table(
index=dims, columns='lifetime', values='user_id', aggfunc='nunique'
)
# calculate the cumulative amount for each row
result = result.fillna(0).cumsum(axis = 1)
# calculate cohort sizes
cohort_sizes = (
df.groupby(dims)
.agg({'user_id': 'nunique'})
.rename(columns={'user_id': 'cohort_size'})
)
# adding cohort sizes to the conversion table
result = cohort_sizes.merge(result, on=dims, how='left').fillna(0)
# divide each "cell" in a row by the size of the cohort and get the conversion rate
result = result.div(result['cohort_size'], axis=0)
# exclude all liftimes exceeding the analysis horizon
result = result[['cohort_size'] + list(range(horizon_days))]
# restoring cohort sizes
result['cohort_size'] = cohort_sizes
return result
# get the conversion table
result_grouped = group_by_dimensions(result_raw, dimensions, horizon_days)
# for the conversion dynamics table, remove 'cohort' from dimensions
if 'cohort' in dimensions:
dimensions = []
# get a table of conversion dynamics
result_in_time = group_by_dimensions(
result_raw, dimensions + ['dt'], horizon_days
)
# returning the both tables and raw data
return result_raw, result_grouped, result_in_time
# get_retention() — to calculate Retention Rate
def get_retention(
profiles,
sessions,
observation_date,
horizon_days,
dimensions=[],
ignore_horizon=False,
):
# adding the payer column to the dimensions list being passed
dimensions = ['payer'] + dimensions
# exclude users who have not "lived up" to the horizon of analysis
last_suitable_acquisition_date = observation_date
if not ignore_horizon:
last_suitable_acquisition_date = observation_date - timedelta(
days=horizon_days - 1
)
result_raw = profiles.query('dt <= @last_suitable_acquisition_date')
# collect "raw" data to calculate retention
result_raw = result_raw.merge(
sessions[['user_id', 'session_start']], on='user_id', how='left'
)
result_raw['lifetime'] = (
result_raw['session_start'] - result_raw['first_ts']
).dt.days
# function for grouping a table by desired attributes
def group_by_dimensions(df, dims, horizon_days):
result = df.pivot_table(
index=dims, columns='lifetime', values='user_id', aggfunc='nunique'
)
cohort_sizes = (
df.groupby(dims)
.agg({'user_id': 'nunique'})
.rename(columns={'user_id': 'cohort_size'})
)
result = cohort_sizes.merge(result, on=dims, how='left').fillna(0)
result = result.div(result['cohort_size'], axis=0)
result = result[['cohort_size'] + list(range(horizon_days))]
result['cohort_size'] = cohort_sizes
return result
# get the retention table
result_grouped = group_by_dimensions(result_raw, dimensions, horizon_days)
# get the table of retention dynamics
result_in_time = group_by_dimensions(
result_raw, dimensions + ['dt'], horizon_days
)
# returning the both tables and raw data
return result_raw, result_grouped, result_in_time
# get_ltv() — to calculate LTV
def get_ltv(
profiles,
purchases,
observation_date,
horizon_days,
dimensions=[],
ignore_horizon=False,
):
# exclude users who have not "lived up" to the horizon of analysis
last_suitable_acquisition_date = observation_date
if not ignore_horizon:
last_suitable_acquisition_date = observation_date - timedelta(
days=horizon_days - 1
)
result_raw = profiles.query('dt <= @last_suitable_acquisition_date')
# adding data about purchases in the profile
result_raw = result_raw.merge(
purchases[['user_id', 'event_dt', 'revenue']], on='user_id', how='left'
)
# calculate user's lifetime for each purchase
result_raw['lifetime'] = (
result_raw['event_dt'] - result_raw['first_ts']
).dt.days
# group by cohort if there is nothing in dimensions
if len(dimensions) == 0:
result_raw['cohort'] = 'All users'
dimensions = dimensions + ['cohort']
# grouping function by desired attributes
def group_by_dimensions(df, dims, horizon_days):
# building a "triangular" revenue table
result = df.pivot_table(
index=dims, columns='lifetime', values='revenue', aggfunc='sum'
)
# find the amount of revenue with accumulation
result = result.fillna(0).cumsum(axis=1)
# calculate the cohort sizes
cohort_sizes = (
df.groupby(dims)
.agg({'user_id': 'nunique'})
.rename(columns={'user_id': 'cohort_size'})
)
# combining cohort sizes and revenue table
result = cohort_sizes.merge(result, on=dims, how='left').fillna(0)
# counting LTV: divide each "cell" in a row by the size of the cohort
result = result.div(result['cohort_size'], axis=0)
# exclude all liftimes exceeding the analysis horizon
result = result[['cohort_size'] + list(range(horizon_days))]
# restoring cohort sizes
result['cohort_size'] = cohort_sizes
# saving user data and CAC values to the dataframe,
# adding parameters from dimensions
cac = df[['user_id', 'acquisition_cost'] + dims].drop_duplicates()
# calculate average CAC according to the parameters from dimensions
cac = (
cac.groupby(dims)
.agg({'acquisition_cost': 'mean'})
.rename(columns={'acquisition_cost': 'cac'})
)
# counting ROI: dividing LTV by CAC
roi = result.div(cac['cac'], axis=0)
# deleting rows with infinite ROI
roi = roi[~roi['cohort_size'].isin([np.inf])]
# restoring cohort sizes in the ROI table
roi['cohort_size'] = cohort_sizes
# adding CAC to the ROI table
roi['cac'] = cac['cac']
# in the final table we leave the cohort sizes, CAC
# and ROI in lifetimes not exceeding the analysis horizon
roi = roi[['cohort_size', 'cac'] + list(range(horizon_days))]
# return the LTV and ROI tables
return result, roi
# get the LTV and ROI tables
result_grouped, roi_grouped = group_by_dimensions(
result_raw, dimensions, horizon_days
)
# for dynamics tables, we remove 'cohort' from dimensions
if 'cohort' in dimensions:
dimensions = []
# get tables of LTV and ROI dynamics
result_in_time, roi_in_time = group_by_dimensions(
result_raw, dimensions + ['dt'], horizon_days
)
return (
result_raw, # raw data
result_grouped, # table of LTV
result_in_time, # dinamics LTV table
roi_grouped, # ROI table
roi_in_time, # ROI dinamics table
)
# filter_data() — to smooth data
def filter_data(df, window):
# using a moving average for each column
for column in df.columns.values:
df[column] = df[column].rolling(window).mean()
return df
# plot_conversion() — to make graph of conversion
def plot_conversion(conversion, conversion_history, horizon, window=7):
# setting the grid size for graphs
plt.figure(figsize=(15, 5))
# exclude cohort sizes
conversion = conversion.drop(columns=['cohort_size'])
# in the dynamics table, we leave only the necessary lifetime
conversion_history = conversion_history.drop(columns=['cohort_size'])[
: # [horizon - 1] horizons - a list, not limited
]
# first graph is conversion curves
ax1 = plt.subplot(1, 2, 1)
conversion.T.plot(grid=True, ax=ax1)
plt.legend()
plt.xlabel('Lifetime')
plt.title('User conversion')
# second graph is the conversion dynamics
ax2 = plt.subplot(1, 2, 2, sharey=ax1)
columns = [
# columns of the pivot table will be all columns of index, except for the date
name for name in conversion_history.index.names if name not in ['dt']
]
for hor in horizon: # we modify it so that we can get a list of horizons
filtered_data = conversion_history.pivot_table(
index='dt', columns=columns, values=hor - 1, aggfunc='mean'
)
filter_data(filtered_data, window).plot(grid=True, ax=ax2)
jj = str(horizon).strip('][')
plt.xlabel('Date of acquision')
plt.title(f'User conversion dynamics on {jj}th day') # modified
plt.tight_layout()
plt.show()
# plot_retention() — to make graph of Retention Rate
def plot_retention(retention, retention_history, horizon, window=7):
# setting the grid size for graphs
plt.figure(figsize=(15, 5))
# exclude cohort sizes
retention = retention.drop(columns=['cohort_size'])
# in the dynamics table, we leave only the necessary lifetime
retention_history = retention_history.drop(columns=['cohort_size'])[
: # [horizon - 1] there are several horizons, so we don't cut off the extra columns
]
# first graph is retention curves
ax1 = plt.subplot(1, 2, 1)
retention.T.plot(grid=True, ax=ax1)
plt.legend()
plt.xlabel('Lifetime')
plt.title('User retention')
# second graph is retention dynamics
ax2 = plt.subplot(1, 2, 2, sharey=ax1)
columns = [
# columns of pivot table will be all the columns of index, except for the date
name for name in retention_history.index.names if name not in ['dt']
]
for hor in horizon: # modify it so that we can get a list of horizons
filtered_data = retention_history.pivot_table(
index='dt', columns=columns, values=hor - 1, aggfunc='mean'
)
filter_data(filtered_data, window).plot(grid=True, ax=ax2)
jj = str(horizon).strip('][')
plt.xlabel('Date of acquision')
plt.title(f'Dynamics of user retention on {jj}th day') # modified
plt.tight_layout()
plt.show()
def plot_retention_3(profiles, sessions, date, horizon, size=(8, 15)):
profiles_noorg = profiles.query('channel != "organic"') # removing organic
profiles_org = profiles.query('channel == "organic"') # leaving only organic
# setting the grid size for graphs
plt.figure(figsize=size)
# 1st graph, we get grouping by country for banner clients
retention_raw, retention, retention_history = get_retention(
profiles_noorg, sessions, date, horizon, dimensions=['region']
)
# exclude cohort sizes
retention = retention.drop(columns=['cohort_size'])
# retention curves are displayed in the cell №1
ax1 = plt.subplot(3, 2, 1)
retention.T.plot(grid=True, ax=ax1)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of banner users')
# 3rd graph, we get grouping by devices for banner clients
retention_raw, retention, retention_history = get_retention(
profiles_noorg, sessions, date, horizon, dimensions=['device']
)
# exclude cohort sizes
retention = retention.drop(columns=['cohort_size'])
# the retention curves are displayed in cell №3
ax3 = plt.subplot(3, 2, 3)
retention.T.plot(grid=True, ax=ax3)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of banner users')
# 5th graph, we get grouping by channels for banner clients
retention_raw, retention, retention_history = get_retention(
profiles_noorg, sessions, date, horizon, dimensions=['channel']
)
# exclude cohort sizes
retention = retention.drop(columns=['cohort_size'])
# the retention curves are displayed in cell №5
ax5 = plt.subplot(3, 2, 5)
retention.T.plot(grid=True, ax=ax5)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of banner users')
# 2nd graph, we get a grouping by country for "organics"
retention_raw, retention, retention_history = get_retention(
profiles_org, sessions, date, horizon, dimensions=['region']
)
# exclude cohort sizes
retention = retention.drop(columns=['cohort_size'])
# the retention curves are displayed in cell №2
ax2 = plt.subplot(3, 2, 2)
retention.T.plot(grid=True, ax=ax2)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of banner users')
# 4th graph, we get a grouping by devices for "organics"
retention_raw, retention, retention_history = get_retention(
profiles_org, sessions, date, horizon, dimensions=['device']
)
# exclude cohort sizes
retention = retention.drop(columns=['cohort_size'])
# the retention curves are displayed in cell №4
ax4 = plt.subplot(3, 2, 4)
retention.T.plot(grid=True, ax=ax4)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of banner users')
# 6th graph, we get grouping by channels for "organics"
retention_raw, retention, retention_history = get_retention(
profiles_org, sessions, date, horizon, dimensions=['channel']
)
# exclude cohort sizes
retention = retention.drop(columns=['cohort_size'])
# the retention curves are displayed in cell №6
ax6 = plt.subplot(3, 2, 6)
retention.T.plot(grid=True, ax=ax6)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of banner users')
plt.tight_layout()
plt.show()
# plot_ltv_roi() — to plot LTV and ROI
def plot_ltv_roi(ltv, ltv_history, roi, roi_history, horizon, window=1):
# setting the grid for drawing graphs
plt.figure(figsize=(20, 20))
# exclude cohort sizes from the ltv table
ltv = ltv.drop(columns=['cohort_size'])
# in the ltv dynamics table, we leave only the necessary lifetime
ltv_history = ltv_history.drop(columns=['cohort_size'])[[horizon - 1]]
# cost of attraction will be recorded in a separate frame
cac_history = roi_history[['cac']]
# exclude cohort sizes and cac from the roi table
roi = roi.drop(columns=['cohort_size', 'cac'])
# in the roi dynamics table, we leave only the necessary lifetime
roi_history = roi_history.drop(columns=['cohort_size', 'cac'])[
[horizon - 1]
]
# the first graph is ltv curves
ax1 = plt.subplot(3, 2, 1)
ltv.T.plot(grid=True, ax=ax1)
plt.legend()
plt.xlabel('Lifetime')
plt.title('LTV')
# the second graph is dynamics of ltv
ax2 = plt.subplot(3, 2, 2, sharey=ax1)
# columns of pivot table will be all columns of index, except for the date
columns = [name for name in ltv_history.index.names if name not in ['dt']]
filtered_data = ltv_history.pivot_table(
index='dt', columns=columns, values=horizon - 1, aggfunc='mean'
)
filter_data(filtered_data, window).plot(grid=True, ax=ax2)
plt.xlabel('Date of acquision')
plt.title('Dynamics of LTV users on {}th day'.format(horizon))
# third graph - dynamics of cac
ax3 = plt.subplot(3, 2, 5, sharey=ax1)
# columns of pivot table will be all columns of index, except for the date
columns = [name for name in cac_history.index.names if name not in ['dt']]
filtered_data = cac_history.pivot_table(
index='dt', columns=columns, values='cac', aggfunc='mean'
)
filter_data(filtered_data, window).plot(grid=True, ax=ax3)
plt.xlabel('Date of attraction')
plt.title('Dynamics of customers acquisition cost CAC')
# fourth graph, roi curves
ax4 = plt.subplot(3, 2, 3)
roi.T.plot(grid=True, ax=ax4)
plt.axhline(y=1, color='red', linestyle='--', label='Payback level')
plt.legend()
plt.xlabel('Lifetime')
plt.title('ROI')
# fifth chart is the dynamics of roi
ax5 = plt.subplot(3, 2, 4, sharey=ax4)
# columns of pivot table will be all columns of index, except for the date
columns = [name for name in roi_history.index.names if name not in ['dt']]
filtered_data = roi_history.pivot_table(
index='dt', columns=columns, values=horizon - 1, aggfunc='mean'
)
filter_data(filtered_data, window).plot(grid=True, ax=ax5)
plt.axhline(y=1, color='red', linestyle='--', label='Payback level')
plt.xlabel('Date of acquision')
plt.title('ROI dynamics on {}th day'.format(horizon))
plt.tight_layout()
plt.show()
profiles = get_profiles(sessions) # call the 1st (simple) version of this function
profiles.head()
| user_id | first_ts | channel | dt | month | |
|---|---|---|---|---|---|
| 0 | 599326 | 2019-05-07 20:58:57 | FaceBoom | 2019-05-07 | 2019-05-01 |
| 1 | 4919697 | 2019-07-09 12:46:07 | FaceBoom | 2019-07-09 | 2019-07-01 |
| 2 | 6085896 | 2019-10-01 09:58:33 | organic | 2019-10-01 | 2019-10-01 |
| 3 | 22593348 | 2019-08-22 21:35:48 | AdNonSense | 2019-08-22 | 2019-08-01 |
| 4 | 31989216 | 2019-10-02 00:07:44 | YRabbit | 2019-10-02 | 2019-10-01 |
# split profiles according to the channels of aquisition
prof_chan = (profiles.groupby('channel').agg({'user_id': 'nunique'})
.sort_values(by='user_id', ascending=False).reset_index()
.rename({'user_id':'number'}, axis='columns'))
prof_chan['%']= round(100 * prof_chan['number'] / prof_chan['number'].sum(), 2) # adding %
display(prof_chan)
print('Total customers for the period: ', prof_chan['number'].sum()) # number of customers
| channel | number | % | |
|---|---|---|---|
| 0 | organic | 56439 | 37.62 |
| 1 | FaceBoom | 29144 | 19.43 |
| 2 | TipTop | 19561 | 13.04 |
| 3 | OppleCreativeMedia | 8605 | 5.74 |
| 4 | LeapBob | 8553 | 5.70 |
| 5 | WahooNetBanner | 8553 | 5.70 |
| 6 | RocketSuperAds | 4448 | 2.97 |
| 7 | MediaTornado | 4364 | 2.91 |
| 8 | YRabbit | 4312 | 2.87 |
| 9 | AdNonSense | 3880 | 2.59 |
| 10 | lambdaMediaAds | 2149 | 1.43 |
Total customers for the period: 150008
# once again display the initial intervals of sessions, orders and payments for advertising
print('Sessions:',sessions['session_start'].min(),'-', sessions['session_start'].max())
print('Orders:',orders['event_dt'].min(), '-', orders['event_dt'].max())
print('Ads:',costs['dt'].min(), '-', costs['dt'].max())
Sessions: 2019-05-01 00:00:41 - 2019-10-31 23:59:23 Orders: 2019-05-01 00:28:11 - 2019-10-31 23:56:56 Ads: 2019-05-01 - 2019-10-27
# display the start and end of time interval for customers acquisition
print('Acquisitions:', profiles['first_ts'].min(),'-', profiles['first_ts'].max())
Acquisitions: 2019-05-01 00:00:41 - 2019-10-27 23:59:04
Conclusions on user profiles (without identifying payers):
# call function ver. 2 of the get_profiles function and build full user profiles
profiles_full = get_profiles_2(sessions, orders, [], costs, event_names=[])
profiles_full.head()
| user_id | first_ts | channel | device | region | dt | month | payer | acquisition_cost | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 599326 | 2019-05-07 20:58:57 | FaceBoom | Mac | United States | 2019-05-07 | 2019-05-01 | True | 1.088172 |
| 1 | 4919697 | 2019-07-09 12:46:07 | FaceBoom | iPhone | United States | 2019-07-09 | 2019-07-01 | False | 1.107237 |
| 2 | 6085896 | 2019-10-01 09:58:33 | organic | iPhone | France | 2019-10-01 | 2019-10-01 | False | 0.000000 |
| 3 | 22593348 | 2019-08-22 21:35:48 | AdNonSense | PC | Germany | 2019-08-22 | 2019-08-01 | False | 0.988235 |
| 4 | 31989216 | 2019-10-02 00:07:44 | YRabbit | iPhone | United States | 2019-10-02 | 2019-10-01 | False | 0.230769 |
# To separate users by countries, we will write a report function, which we'll use later in 3.3 & 3.4.
def sheet_gen(dim, prt=True):
user_total = (profiles_full.groupby(dim) # all users by dim (regions, devices, etc.)
.agg({'user_id':'nunique'})
.reset_index())
user_payer = (profiles_full.query('payer == True') # paying customers by dim
.groupby(dim)
.agg({'user_id':'nunique'})
.reset_index())
table = (user_total.merge(user_payer, on=dim) # combine tables by the dim column
.rename(columns={'user_id_x': 'total','user_id_y': 'payers'})) # give the columns clear names
table['pay_%'] = round((100 * table['payers'] / table['total']), 2) # share payers
table['share'] = round((100 * table['total'] / table['total'].sum()), 2) # share dim
df = table.sort_values(by='payers', ascending=False) # sort in descending order of payers
if prt: # may print, or may only get a table
print(f'Result table, grouping on: {dim} \n\n', df.to_string(index=False)) # cut off indexes
return df # from this table, columns can be added to others
sheet_gen(dim='region'); # calling function to print the report
Result table, grouping on: region
region total payers pay_% share
United States 100002 6902 6.90 66.66
UK 17575 700 3.98 11.72
France 17450 663 3.80 11.63
Germany 14981 616 4.11 9.99
# add a visualization of the number of users and payers on a bar chart
yy = sheet_gen(dim='region',prt=False).set_index('region')[['total','payers']]
yy.plot(kind='bar',figsize=(4, 3))
plt.xticks(rotation = 0)
plt.title('Customers by regions')
plt.ylabel('Quantity')
plt.xlabel('Countries')
plt.show();
Conclusions upon countries and payers:
sheet_gen(dim='device'); # call function to output the report
Result table, grouping on: device
device total payers pay_% share
iPhone 54479 3382 6.21 36.32
Android 35032 2050 5.85 23.35
Mac 30042 1912 6.36 20.03
PC 30455 1537 5.05 20.30
Conclusions by devices:
sheet_gen(dim='channel'); # вызываем функцию для вывода отчёта
Result table, grouping on: channel
channel total payers pay_% share
FaceBoom 29144 3557 12.20 19.43
TipTop 19561 1878 9.60 13.04
organic 56439 1160 2.06 37.62
WahooNetBanner 8553 453 5.30 5.70
AdNonSense 3880 440 11.34 2.59
RocketSuperAds 4448 352 7.91 2.97
LeapBob 8553 262 3.06 5.70
OppleCreativeMedia 8605 233 2.71 5.74
lambdaMediaAds 2149 225 10.47 1.43
YRabbit 4312 165 3.83 2.87
MediaTornado 4364 156 3.57 2.91
Conclusions on channels of acquisition:
round(costs['costs'].sum()) # amount of expenses is rounded up to whole
105497
# aggregate expenditure data by channels, summarize and sort in descending order
a = costs.groupby('channel').agg({'costs':'sum'}).sort_values(by='costs',ascending=False)
a['%'] = round(100 * a['costs'] / a['costs'].sum()) # adding column %
b = sheet_gen(dim='channel',prt=False) # get table of paying users by channels
a.merge(b[['channel','share']],on='channel')
# column share - share of paying users (without organic 37.62%)
| channel | costs | % | share | |
|---|---|---|---|---|
| 0 | TipTop | 54751.30 | 52.0 | 13.04 |
| 1 | FaceBoom | 32445.60 | 31.0 | 19.43 |
| 2 | WahooNetBanner | 5151.00 | 5.0 | 5.70 |
| 3 | AdNonSense | 3911.25 | 4.0 | 2.59 |
| 4 | OppleCreativeMedia | 2151.25 | 2.0 | 5.74 |
| 5 | RocketSuperAds | 1833.00 | 2.0 | 2.97 |
| 6 | LeapBob | 1797.60 | 2.0 | 5.70 |
| 7 | lambdaMediaAds | 1557.60 | 1.0 | 1.43 |
| 8 | MediaTornado | 954.48 | 1.0 | 2.91 |
| 9 | YRabbit | 944.22 | 1.0 | 2.87 |
# make a copy of dataframe and add columns with year (not needed), month and week in it
a = costs.copy()
a['dt'] = pd.to_datetime(a['dt'])
a['year'] = a['dt'].dt.year
a['month'] = a['dt'].dt.month
a['week'] = a['dt'].dt.isocalendar().week
a.head() # проверим результат
| dt | channel | costs | year | month | week | |
|---|---|---|---|---|---|---|
| 0 | 2019-05-01 | FaceBoom | 113.3 | 2019 | 5 | 18 |
| 1 | 2019-05-02 | FaceBoom | 78.1 | 2019 | 5 | 18 |
| 2 | 2019-05-03 | FaceBoom | 85.8 | 2019 | 5 | 18 |
| 3 | 2019-05-04 | FaceBoom | 136.4 | 2019 | 5 | 18 |
| 4 | 2019-05-05 | FaceBoom | 122.1 | 2019 | 5 | 18 |
# in the loop, iterate over unique channels from the costs.channel column, we use numerator
# to add captions to lines in the legend, because there are quite a lot of lines
# (it is possible to bring all data into one table, but NaN will appear, and rendering will have breaks)
plt.figure(figsize=(16,10)) # graph's size
for i in enumerate(costs.channel.unique()): # here i[0] - serial numbers, i[1] - unique channel names
c = a[a['channel']==i[1]] # dataframe a we've just formed, from it now taking data by channels
wk = c.groupby(['week']).agg({'costs':'sum','dt':'first'}).set_index('dt')[['costs']] # costs-weeks
mn = c.groupby(['month']).agg({'costs':'sum','dt':'first'}).set_index('dt')[['costs']] # costs-months
wk_gr = plt.plot(wk, label=str(i[0])+'1 '+i[1]+'-week') # give different labels using the numerator
color = wk_gr[0].get_color() # under the name wk_gr we have a line, its color code is stored in it at index 0
plt.text(x=wk.index[-1],y=wk.loc[wk.index[-1],'costs'],s=str(i[0])+'1',c=color,weight='bold') # numerator to line end
mn_gr = plt.plot(mn, label=str(i[0])+'2 '+i[1]+'-month') # same, for months
color = mn_gr[0].get_color()
plt.text(x=mn.index[-1],y=mn.loc[mn.index[-1],'costs'],s=str(i[0])+'2',c=color,weight='bold') # same, for months
plt.title('Dynamics of expenditure changes over time (by weeks and months) on each source')
plt.xlabel('Time') # axis labels and graph name
plt.ylabel('Expenses')
plt.legend() # line signatures on graph (legend)
plt.grid() # grid
plt.show()
Output according to the schedule of the dynamics of changes in expenses over time (by weeks and months) for each source: spending on TipTop and AdNonSense channels is the highest, and in the reporting period they generally grew.
# to solve this problem, we aggregate data from complete profiles built earlier,
# and then apply the arithmetic mean and median functions
a = (profiles_full.groupby(['channel'])
.agg({'acquisition_cost':['mean','median']})
.reset_index()
)
a.columns = ['channel', 'mean', 'median'] # rename columns
print('CAC expenses by channels, as well as average across the entire database')
a = a.sort_values(by='mean', ascending=False) # output the arithmetic mean in descending order
a.loc[len(a)]=['---all_in_all---', # add total mean and median to the end of list
profiles_full['acquisition_cost'].mean(),
profiles_full['acquisition_cost'].median()]
a
CAC expenses by channels, as well as average across the entire database
| channel | mean | median | |
|---|---|---|---|
| 6 | TipTop | 2.799003 | 3.000000 |
| 1 | FaceBoom | 1.113286 | 1.114667 |
| 0 | AdNonSense | 1.008054 | 1.004348 |
| 9 | lambdaMediaAds | 0.724802 | 0.733333 |
| 7 | WahooNetBanner | 0.602245 | 0.600000 |
| 5 | RocketSuperAds | 0.412095 | 0.404444 |
| 4 | OppleCreativeMedia | 0.250000 | 0.250000 |
| 8 | YRabbit | 0.218975 | 0.214286 |
| 3 | MediaTornado | 0.218717 | 0.220926 |
| 2 | LeapBob | 0.210172 | 0.210000 |
| 10 | organic | 0.000000 | 0.000000 |
| 11 | ---all_in_all--- | 0.703278 | 0.247500 |
Conclusions on the marketing section:
Using LTV, ROI and CAC charts, analyze the payback of advertising. Consider that November 1, 2019 is on the calendar, and the business plan stipulates that users should pay off no later than two weeks after the attraction. Determine if there is a need to include organic users in the analysis.
We will:
Analyze the payback of advertising by advertising channels. Plot LTV and ROI graphs, as well as LTV, CAC, and ROI dynamics graphs.
Extra questions:
Remark: for higher accuracy, the "organic" channel will be removed from profiles_full in this section, i.e. from now on, by default, we will analyze only those users who clicked a paid banner of an advertising provider.
profiles = profiles_full.query('channel != "organic"') # filter organic
1 - len(profiles) / len(profiles_full) # count, how much the table compressed
0.3762399338701936
Apparently, the table has decreased by 37.62% (see 0-row in cell[21] *jump to code cell 21)
# calculate CR conversion, display dynamics on the 2nd, 7th and 14th days, for this function code is slightly changed
conversion_raw, conversion, conversion_history = get_conversion(profiles, orders, datetime(2019, 11, 1).date(), 14)
plot_conversion(conversion, conversion_history, horizon=[2,7,14], window=15) # some horizons for 2nd graph
# now leave only the horizon of 14 days and group by country
conversion_raw, conversion, conversion_history = get_conversion(profiles, orders, datetime(2019, 11, 1).date(),
horizon_days=14, dimensions=['region'])
plot_conversion(conversion, conversion_history, horizon=[14], window=15) # printing out
# leave only horizon of 14 days and group by devices
conversion_raw, conversion, conversion_history = get_conversion(profiles, orders, datetime(2019, 11, 1).date(),
horizon_days=14, dimensions=['device'])
plot_conversion(conversion, conversion_history, horizon=[14], window=15) # print out
# leave only horizon of 14 days and group by channels
conversion_raw, conversion, conversion_history = get_conversion(profiles, orders, datetime(2019, 11, 1).date(),
horizon_days=14, dimensions=['channel'])
plot_conversion(conversion, conversion_history, horizon=[14], window=15) # print out
Conclusions on conversion charts:
# calculate RR retention coefficient, display days 2 and 14 (payers + non-payers) on the retention dynamics graph
retention_raw, retention, retention_history = get_retention(
profiles, sessions, datetime(2019, 11, 1).date(), 14, dimensions=[]
)
plot_retention(retention, retention_history, horizon=[2,14], window=10)
To plot retention graphs with grouping by 3 features, let's turn to the new function. Instead of graphs of dynamics on the right, we will output (for comparison) similar graphs of retention of users of the "organic" channel. The share of "organic" users is a significant 37.62%, so it is also useful to analyze them, although this is not directly part of the task of the study.
# passing the full user profiles, date, horizon and canvas size to the function
plot_retention_3(profiles_full, sessions, datetime(2019, 11, 1).date(), horizon=14, size=(15, 16))
Conclusions on retention graphs:
# count LTV, CAC and ROI on the 14th day for all users
ltv_raw, ltv, ltv_history, roi, roi_history = get_ltv(
profiles,
orders,
datetime(2019, 11, 1).date(),
14,
dimensions=[]
)
# graphs: standard function was rebuilt for 2 columns, order of their output was changed
plot_ltv_roi(ltv, ltv_history, roi, roi_history, horizon=14, window=10)
Conclusions on analytics without allocation of groups:
# count LTV, CAC, and ROI grouped by devices that users log in from
ltv_raw, ltv, ltv_history, roi, roi_history = get_ltv(
profiles,
orders,
datetime(2019, 11, 1).date(),
14,
dimensions=['device']
)
# building graphs with same function as the previous time, averaging window is 10 days
plot_ltv_roi(ltv, ltv_history, roi, roi_history, horizon=14, window=10)
Conclusions on grouped by device:
# calculate LTV, CAC and ROI by country (region)
ltv_raw, ltv, ltv_history, roi, roi_history = get_ltv(
profiles,
orders,
datetime(2019, 11, 1).date(),
14,
dimensions=['region']
)
# building graphs with increased averaging window to 15 days
plot_ltv_roi(ltv, ltv_history, roi, roi_history, horizon=14, window=15)
Conclusions grouped by country:
# list of channels for easy reading of graphs is below. To zoom graphics in the browser, press Ctrl+
list(sessions['channel'].unique())
['organic', 'TipTop', 'RocketSuperAds', 'YRabbit', 'FaceBoom', 'MediaTornado', 'AdNonSense', 'LeapBob', 'WahooNetBanner', 'OppleCreativeMedia', 'lambdaMediaAds']
# count LTV, CAC and ROI broken down by user engagement channels
ltv_raw, ltv, ltv_history, roi, roi_history = get_ltv(
profiles,
orders,
datetime(2019, 11, 1).date(),
14,
dimensions=['channel']
)
# building graphs, set the averaging window at the level of 30
plot_ltv_roi(ltv, ltv_history, roi, roi_history, horizon=14, window=30)
Conclusions on grouped by channels:
From the data above, we know that the number of users in the USA is 2/3, the rest are from 3 European countries. The graphs show how economic indicators have become strikingly different in the United States and European countries since the summer of 2019. Let's repeat the previous analysis by region, but this time we will gather 3 European countries into one "EU" region.
# create a copy of banner users' dataframe and change non-USA by EU
profiles_usaeu = profiles.copy()
profiles_usaeu.loc[profiles['region'] != 'United States', 'region'] = 'EU'
profiles_usaeu.head()
| user_id | first_ts | channel | device | region | dt | month | payer | acquisition_cost | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 599326 | 2019-05-07 20:58:57 | FaceBoom | Mac | United States | 2019-05-07 | 2019-05-01 | True | 1.088172 |
| 1 | 4919697 | 2019-07-09 12:46:07 | FaceBoom | iPhone | United States | 2019-07-09 | 2019-07-01 | False | 1.107237 |
| 3 | 22593348 | 2019-08-22 21:35:48 | AdNonSense | PC | EU | 2019-08-22 | 2019-08-01 | False | 0.988235 |
| 4 | 31989216 | 2019-10-02 00:07:44 | YRabbit | iPhone | United States | 2019-10-02 | 2019-10-01 | False | 0.230769 |
| 7 | 46006712 | 2019-06-30 03:46:29 | AdNonSense | Android | EU | 2019-06-30 | 2019-06-01 | True | 1.008000 |
# count LTV, CAC and ROI by country (Europe vs. USA)
ltv_raw, ltv, ltv_history, roi, roi_history = get_ltv(
profiles_usaeu,
orders,
datetime(2019, 11, 1).date(),
14,
dimensions=['region']
)
# building graphs, increased the averaging window to 15 days
plot_ltv_roi(ltv, ltv_history, roi, roi_history, horizon=14, window=15)
Conclusions with the US/EU grouping:
Here we'll highlight the reasons for the inefficiency of attracting users, and formulate recommendations for the marketing department.
Conclusions on user profiles (without identifying payers):
Conclusions upon countries and payers:
Conclusions on grouping by devices:
Conclusions on channels of acquisition:
Conclusion according to the graph of dynamics in expenses over time (by weeks and months) for each source: spending on TipTop and AdNonSense channels is the highest, and in the reporting period they generally grew.
Conclusions on the marketing section:
Conclusions on conversion charts:
Conclusions on retention graphs:
Conclusions on analytics without allocation of groups:
Conclusions on grouped by device:
Conclusions grouped by country:
Conclusions on grouped by channels:
Conclusions with the US/EU grouping:
Main problems and recommendations: